!pr3
More and Better Division by Seven.........Bob Sander-Cederlof

I can think of at least three good reasons we need a good subroutine for dividing by seven.  We need it in computations involving the day of week.  We need it in hi-res graphics programs to calculate the byte and bit for a particular pixel between 0 and 279 for normal hi-res, or between 0 and 559 for double hi-res.  Lastly, the new protocol converter interface used in connection with the Unidisk 3.5 works with packets of up to 767 bytes which are made up of a number of 7-byte groups.

In looking through the assembly listing of the new //c ROMs, which come with the Unidisk 3.5 update, I noticed a divide-by- seven subroutine at $CB45-CBAF.  The code divides the buffer size, which can be up to $2FF, by seven, and saves both the quotient and the remainder.  The code looks too large and too slow and too complicated ... in other words, it looks like a challenging assignment.  My transposition of the //c code follows, and as I count cycles it takes from 133 to 268 cycles depending on the value of the dividend.  The code and tables take 71 bytes in the //c ROM.

While I was musing on the possibilities, Michael Hackney called me from Troy, New York.  He wondered if we were interested in publishing his fast 65802 routine for dividing by seven.  Michael uses his in a speedy double hi-res program.  He divides values up to 559 ($22F) by seven, keeping both the quotient and remainder, in 66 cycles.  Michael's subroutine itself is short (37 bytes), but he uses a 140-byte table to achieve the speed.  Adding another 84 bytes to the tables extends the range to handle dividends up to 895 ($37F).

(In all the times and lengths given here, I am not counting the JSR-RTS cycles nor the RTS byte.  I assume the code is critical enough that it would be placed in-line in actual use, rather than made into a JSR-called subroutine.  I am also not counting any overhead I added to switch from 65802 mode to 6502 and back, as this was only added due to my test program being in 65802 mode.  All of the subroutines use page zero for variable and temporary storage.  They would be longer and slightly slower if the variables and temporaries were not in page zero.)

Yesterday I spent the whole day dividing by seven.  I came up with two new subroutines:  one for the 65802, and one for a normal 6502.  They are both small and fast.  First I tackled the 65802 version, and based in on multiplying by 1/7 as a binary fraction.  This one came out 39 bytes long, executing in 64 cycles.  This one used a fudge factor; the largest dividend it can handle is 594 ($252).  By using alternate code to extend the precision, numbers up to 895 ($37F) can be handled.  This one takes the same number of bytes, but 9 cycles longer.

Finally, I wrote a normal 6502 version.  Strangely enough, it came out only 60 bytes long and only 76 cycles!  Makes me wonder if I couldn't do better in the 65802, given another day or two.  The 6502 version handles dividends up to 1023 ($3FF).  It would be two bytes shorter if the range was restricted to $2FF.

Here is a table summarizing the size, timing, and dividend range for the various subroutines:

                       bytes   cycles   dividend
                       -------------------------
             //c ROM     71   133-268    0-$2FF
       Hackney 65802    177      66      0-$22F
          RBSC 65802-1   39      64      0-$252
          RBSC 65802-2   39      73      0-$37F
           RBSC 6502     60      76      0-$3FF

The listing which follows includes all five versions, plus a testing program.  The testing program runs through the entire range from $3FF down to 0.  After doing the division by the selected method, a check subroutine tests for a valid remainder (a number less than 7); it further tests that the quotient*7 +remainder = the original dividend.  If not, the dividend, quotient, and remainder are all printed in hexadecimal.  If they are correct, the next dividend is tried.  A keyboard pausing subroutine allows you to stop the display momentarily and/or abort the test run.

Lines 1020-1060 control some conditional assembly which select which division method to use.  By change the value of VERSION in line 1020 I can assemble any one of the four routines.  I used the "CON" listing option in line 1180 (which is not itself listed: it is "1180    .LIST CON")  so that you can see what the un-assembled lines of code are.  Other conditional code at lines 1720-1860 and 4010-4050 selects options mentioned above.

Lines 1200-1540 control each test run.  I wrote this program using 65802 instructions, although it would not be difficult to re-write it for a plain 6502.  Lines 1210-1220 enter the 65802 Native Mode, and lines 1520-1530 leave it.  It is VERY IMPORTANT to be sure you do not exit a program and return to normal Apple software while still in the Native Mode.  The most fantastic things can happen if you forget!

Lines 1580-1950 are my 65802 version.  This entire subroutine is executed in the 65802 native mode, with the M-bit set so the A-register operations are 16-bits.  The value 1/7 in binary is .001001001001001...forever.  Multiplying by than number should give the same answer as dividing by seven.  It also has the surprising side effect that the three bits after the "quotient" portion of the product will be equal to the "remainder".  The values of the fractions from 0/7 to 6/7 are just nice that way:

              repeating  same value   the first
     fraction  decimal     in hex     three bits
       0/7    .000000     .000          000
       1/7    .142857..   .249..        001
       2/7    .285714..   .492..        010
       3/7    .428571..   .6DB..        011
       4/7    .571428..   .924..        100
       5/7    .714285..   .B6D..        101
       6/7    .857142..   .DB6..        110

Wow!  Isn't that neat?  More justification for the numerologists who claim that seven is the "perfect" number.

Now it remains to find the most efficient way to multiply by that fraction.  The method I came up with first forms the product for .01000001 (lines 1600-1670).  Then I divide that result by 8, which is the product for .00001000001 (lines 1680- 1700).  Adding the two products in line 1710 gives me the product for .01001001001 (approximately 2/7).  Dividing that by two gives me an approximation for the division by seven.  The code that follows in lines 1720-1800 is not assembled, because of the ".DO 0" line.  What it does is extend the multiplication to include one more partial product.  The shortest way I could think of to get that little number is demonstrated in the code you see.  The extra precision makes my subroutine work for dividends up to $37F.  It fails above that value because of overflow during the multiplication.  If I leave out the extra precision, the subroutine gets the wrong answers for some numbers at each end of the range.  By adding a "fudge factor" (a trick learned in college laboratory assignments to force experimental results to fit the laws of science), I can make all the dividends up to $252 work.  The fudge factor adds $000A for values in the A-register of $8800 or more, and only $0008 for values below $8800.

Line 1870 is the division by two mentioned above.  Lines 1880-1940 shift the first three bits of the remainder over to the correct position in the lower byte of the A-register.  As I was writing the previous sentence, it suddenly struck me that the second set of three bits might be the same as the first set, if my multiplications happened to be precise enough.  I went back to the assembler, changed line 1720 to ".DO 1" so the more precise version would assemble, and then replaced lines 1910-1930 with "1910    AND #7".  Guess what!  It worked!  One byte shorter and four cycles faster!  That makes it 38 bytes long, and only 69 cycles.

Next is my 6502 version, lines 1970-2370.  The first four lines simply save the current state of the M and X bits, and the mode, and switch to 6502 emulation mode.  They are matched by lines 2340-2360, which restore the mode and state.  These will work regardless of what mode and state the machine was in when the subroutine was called.  Since the subroutine would normally only be used in a 6502, you would leave out lines 1980-2010 and 2340-2360.  I did not count them when timing the code.  Back in December of 1984 I wrote in these pages of a nifty way to divide a one-byte value by seven.  I used that method here, for dividing the low-order byte of the dividend.  I then computed the remainder by multiplying the quotient by 7 and subtracting it from the dividend.  Saving that quotient and remainder, I used a table lookup to determine the quotient and remainder of the high-order byte of the number.  Since it could only have the values 0-3, the tables are very short.  Then I add the two remainders together, modulo 7; and the two quotients, remembering the carry from the remainder if any.

Lines 2030-2170 are essentially the same as published in that December issue of AAL, except for the addition of lines 2130, 2140, and 2160.  With those two lines I am saving a few steps in the multiplication by seven that I must do.  Lines 2190-2200 finish the multiplication by seven, by adding the *2 and *4 values saved above.  Lines 2210-2200 form the complement of the value, so I can subtract by adding.  Normally a complement is formed by:

       EOR #$FF
       CLC
       ADC #1

I do the same with two less bytes and cycles here by preceding the addition at line 2230 with SEC rather than the usual CLC.  I saved a byte and two cycles by storing one less than the actual remainder in the table of remainders at line 2400.

Lines 2420-2640 are called to print out the results when they don't meet expectations.  Notice lines 2430-2460 and 2610-2630, which make sure I am in the correct state and mode.  The monitor routines will not work correctly in 16-bit state, and may not work correctly in 65802 Native mode.

Lines 2660-2920 check the results.  The subroutine returns with carry clear if the quotient and remainder are correct, or carry set if they are not.  I check both by multiplying the quotient by seven and adding the remainder to see if the result equals the dividend, and I also make sure the remainder is less than seven.  It is possible to get an answer with the quotient one less than it should be and a remainder of 7, so I had to test the remainder.

The PAUSE routine checks to see if any key has been typed.  If so, and if it is not a <RETURN>, it waits until another key is typed.  Note that I had to set 8-bit mode, to prevent the softswitch at $C011 from being switched.  This also makes the CMP work properly.  Otherwise the LDA $C000 would get two copies of the same character in the two halves of the A-register.

Lines 3060-3540 are essentially the code from the new //c ROMs.  I re-arranged it a little, to make a stand-alone routine within my test-bed, and I changed labels and variable names.  Apple uses two sets of tables.  One gives quotients and remainders for 0, $100, and $200 (the high byte of the dividend).  The other gives quotients and remainders for 0, $08, $10, $20, $40, and $80.  A loop runs 5 times to add in the quotients and remainders for bits 3-7 of the dividend, and then fakes one more trip to add in the value of bits 0-2.  Not efficient!

Michael Hackney's code is in lines 3560-4080.  I'll quote from his letter.

"Apple hi-res graphics characteristically involve various calculations to determine the exact display address from a given X,Y pair.  Typically, the vertical position (Y) base address is found by table look-up.  The horizontal, or X, position is determined by dividing by 7 (since there are seven pixel bits per byte in the hi-res screen).  The integer portion of the division is the byte offset from the base address, and the remainder is the position in the byte.  Brute calculation (which is slow for graphics routines) or table lookup (which takes a lot of space) is used to do the division.  Table lookup is usually used in good graphics programs.  Hi-res graphics require two 280-byte tables, one for quotient and one for remainder.  Double hi-res requires tables twice as big.  My interest in 65802/816 double-he-res graphics drivers has prompted me to find a serviceable divide-by-seven which is quick and doesn't require more than one page of memory.

"The 65802/816 16-bit operations are ideally suited for this task.  Larger numbers can be easily manipulated and table lookup can retrieve 2 bytes of data at once.   My routine uses both of these techniques to perform its duty.  It divides the original number by eight before doing any table lookup (this keeps the table smaller).  The it mulitplies both the quotient and remainder retrieved from the table by 8.  The resulting remainder is added to the original lower three bits (the ones shifted out when I divided by 8), and I look into the table again.  The first quotient is added to the second quotient, and it is finished.  The table only takes 140 bytes, storing quotients and remainders for numbers up to 69.  Everything fits in a page with room to spare.

"As an extra bonus, I included a small routine which generates the table in situ.  The area occupied by the table generator can be used for data storage once the table is built.  It takes longer to load a table from disk than it does to compute one, and the generator dissappears after use, so this is the best way to do it."

In order to get the greatest speed, Michael's table should all reside entirely in the same page of memory.  That is why I included line 4100, which justifies the table to the beginning of the next page.

So here you have four great answers to the challenge.  Now it's your turn!
